{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bagging\n",
    "\n",
    "In this notebook we introduce a very natural strategy to build ensembles of\n",
    "machine learning models, named \"bagging\".\n",
    "\n",
    "\"Bagging\" stands for Bootstrap AGGregatING. It uses bootstrap resampling\n",
    "(random sampling with replacement) to learn several models on random\n",
    "variations of the training set. At predict time, the predictions of each\n",
    "learner are aggregated to give the final predictions.\n",
    "\n",
    "We first create a simple synthetic dataset to better understand bootstrapping."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def generate_data(n_samples=30):\n",
    "    \"\"\"Generate synthetic dataset. Returns `data_train`, `data_test`,\n",
    "    `target_train`.\"\"\"\n",
    "    x_min, x_max = -3, 3\n",
    "    rng = np.random.default_rng(1)  # Create a random number generator\n",
    "    x = rng.uniform(x_min, x_max, size=n_samples)\n",
    "    noise = 4.0 * rng.normal(size=(n_samples,))\n",
    "    y = x**3 - 0.5 * (x + 1) ** 2 + noise\n",
    "    y /= y.std()\n",
    "\n",
    "    data_train = pd.DataFrame(x, columns=[\"Feature\"])\n",
    "    data_test = pd.DataFrame(\n",
    "        np.linspace(x_max, x_min, num=300), columns=[\"Feature\"]\n",
    "    )\n",
    "    target_train = pd.Series(y, name=\"Target\")\n",
    "\n",
    "    return data_train, data_test, target_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "data_train, data_test, target_train = generate_data(n_samples=30)\n",
    "sns.scatterplot(\n",
    "    x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n",
    ")\n",
    "_ = plt.title(\"Synthetic regression dataset\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The target to predict is a non-linear function of the only feature. However, a\n",
    "decision tree is capable of approximating such a non-linear dependency:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "tree = DecisionTreeRegressor(max_depth=3, random_state=0)\n",
    "tree.fit(data_train, target_train)\n",
    "y_pred = tree.predict(data_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that the term \"test\" here refers to data that was not used for\n",
    "training and computing an evaluation metric on such a synthetic test set would\n",
    "be meaningless."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.scatterplot(\n",
    "    x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n",
    ")\n",
    "plt.plot(data_test[\"Feature\"], y_pred, label=\"Fitted tree\")\n",
    "plt.legend()\n",
    "_ = plt.title(\"Predictions by a single decision tree\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "Let's see how we can use bootstrapping to learn several trees.\n",
    "\n",
    "## Bootstrap resampling\n",
    "\n",
    "Bootstrapping involves uniformly resampling `n` data points from a dataset of\n",
    "`n` points, with replacement, ensuring each sample has an equal chance of\n",
    "selection.\n",
    "\n",
    "As a result, the output of the bootstrap sampling procedure is another dataset\n",
    "with `n` data points, likely containing duplicates. Consequently, some data\n",
    "points from the original dataset may not be selected for a bootstrap sample.\n",
    "These unselected data points are often referred to as the out-of-bag sample.\n",
    "\n",
    "We now create a function that, given `data` and `target`, returns a\n",
    "resampled variation `data_bootstrap` and `target_bootstrap`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bootstrap_sample(data, target, seed=0):\n",
    "    # Indices corresponding to a sampling with replacement of the same sample\n",
    "    # size than the original data\n",
    "    rng = np.random.default_rng(seed)\n",
    "    bootstrap_indices = rng.choice(\n",
    "        np.arange(target.shape[0]),\n",
    "        size=target.shape[0],\n",
    "        replace=True,\n",
    "    )\n",
    "    # In pandas, we need to use `.iloc` to extract rows using an integer\n",
    "    # position index:\n",
    "    data_bootstrap = data.iloc[bootstrap_indices]\n",
    "    target_bootstrap = target.iloc[bootstrap_indices]\n",
    "    return data_bootstrap, target_bootstrap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "We generate 3 bootstrap samples and qualitatively check the difference\n",
    "with the original dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_bootstraps = 3\n",
    "for bootstrap_idx in range(n_bootstraps):\n",
    "    # draw a bootstrap from the original data\n",
    "    data_bootstrap, target_bootstrap = bootstrap_sample(\n",
    "        data_train,\n",
    "        target_train,\n",
    "        seed=bootstrap_idx,  # ensure bootstrap samples are different but reproducible\n",
    "    )\n",
    "    plt.figure()\n",
    "    plt.scatter(\n",
    "        data_bootstrap[\"Feature\"],\n",
    "        target_bootstrap,\n",
    "        color=\"tab:blue\",\n",
    "        facecolors=\"none\",\n",
    "        alpha=0.5,\n",
    "        label=\"Resampled data\",\n",
    "        s=180,\n",
    "        linewidth=5,\n",
    "    )\n",
    "    plt.scatter(\n",
    "        data_train[\"Feature\"],\n",
    "        target_train,\n",
    "        color=\"black\",\n",
    "        s=60,\n",
    "        alpha=1,\n",
    "        label=\"Original data\",\n",
    "    )\n",
    "    plt.title(f\"Resampled data #{bootstrap_idx}\")\n",
    "    plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Observe that the 3 variations all share common points with the original\n",
    "dataset. Some of the points are randomly resampled several times and appear as\n",
    "darker blue circles.\n",
    "\n",
    "The 3 generated bootstrap samples are all different from the original dataset\n",
    "and from each other. To confirm this intuition, we can check the number of\n",
    "unique samples in the bootstrap samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train_huge, data_test_huge, target_train_huge = generate_data(\n",
    "    n_samples=100_000\n",
    ")\n",
    "data_bootstrap_sample, target_bootstrap_sample = bootstrap_sample(\n",
    "    data_train_huge, target_train_huge\n",
    ")\n",
    "\n",
    "ratio_unique_sample = (\n",
    "    np.unique(data_bootstrap_sample).size / data_bootstrap_sample.size\n",
    ")\n",
    "print(\n",
    "    \"Percentage of samples present in the original dataset: \"\n",
    "    f\"{ratio_unique_sample * 100:.1f}%\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "On average, roughly 63.2% of the original data points of the original dataset\n",
    "are present in a given bootstrap sample. Since the bootstrap sample has the\n",
    "same size as the original dataset, there are many samples that are in the\n",
    "bootstrap sample multiple times.\n",
    "\n",
    "Using bootstrap we are able to generate many datasets, all slightly different.\n",
    "We can fit a decision tree for each of these datasets and they all shall be\n",
    "slightly different as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bag_of_trees = []\n",
    "for bootstrap_idx in range(n_bootstraps):\n",
    "    tree = DecisionTreeRegressor(max_depth=3, random_state=0)\n",
    "\n",
    "    data_bootstrap_sample, target_bootstrap_sample = bootstrap_sample(\n",
    "        data_train, target_train, seed=bootstrap_idx\n",
    "    )\n",
    "    tree.fit(data_bootstrap_sample, target_bootstrap_sample)\n",
    "    bag_of_trees.append(tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Now that we created a bag of different trees, we can use each of the trees to\n",
    "predict the samples within the range of data. They shall give slightly\n",
    "different predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.scatterplot(\n",
    "    x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n",
    ")\n",
    "for tree_idx, tree in enumerate(bag_of_trees):\n",
    "    tree_predictions = tree.predict(data_test)\n",
    "    plt.plot(\n",
    "        data_test[\"Feature\"],\n",
    "        tree_predictions,\n",
    "        linestyle=\"--\",\n",
    "        alpha=0.8,\n",
    "        label=f\"Tree #{tree_idx} predictions\",\n",
    "    )\n",
    "\n",
    "plt.legend()\n",
    "_ = plt.title(\"Predictions of trees trained on different bootstraps\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aggregating\n",
    "\n",
    "Once our trees are fitted, we are able to get predictions from each of them. In\n",
    "regression, the most straightforward way to combine those predictions is just\n",
    "to average them: for a given test data point, we feed the input feature values\n",
    "to each of the `n` trained models in the ensemble and as a result compute `n`\n",
    "predicted values for the target variable. The final prediction of the ensemble\n",
    "for the test data point is the average of those `n` values.\n",
    "\n",
    "We can plot the averaged predictions from the previous example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.scatterplot(\n",
    "    x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n",
    ")\n",
    "\n",
    "bag_predictions = []\n",
    "for tree_idx, tree in enumerate(bag_of_trees):\n",
    "    tree_predictions = tree.predict(data_test)\n",
    "    plt.plot(\n",
    "        data_test[\"Feature\"],\n",
    "        tree_predictions,\n",
    "        linestyle=\"--\",\n",
    "        alpha=0.8,\n",
    "        label=f\"Tree #{tree_idx} predictions\",\n",
    "    )\n",
    "    bag_predictions.append(tree_predictions)\n",
    "\n",
    "bag_predictions = np.mean(bag_predictions, axis=0)\n",
    "plt.plot(\n",
    "    data_test[\"Feature\"],\n",
    "    bag_predictions,\n",
    "    label=\"Averaged predictions\",\n",
    "    linestyle=\"-\",\n",
    ")\n",
    "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
    "_ = plt.title(\"Predictions of bagged trees\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The continuous red line shows the averaged predictions, which would be the final\n",
    "predictions given by our 'bag' of decision tree regressors. Note that the\n",
    "predictions of the ensemble is more stable because of the averaging operation.\n",
    "As a result, the bag of trees as a whole is less likely to overfit than the\n",
    "individual trees.\n",
    "\n",
    "## Bagging in scikit-learn\n",
    "\n",
    "Scikit-learn implements the bagging procedure as a **meta-estimator**, that is,\n",
    "an estimator that wraps another estimator: it takes a base model that is\n",
    "cloned several times and trained independently on each bootstrap sample.\n",
    "\n",
    "The following code snippet shows how to build a bagging ensemble of decision\n",
    "trees. We set `n_estimators=100` instead of 3 in our manual implementation\n",
    "above to get a stronger smoothing effect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import BaggingRegressor\n",
    "\n",
    "bagged_trees = BaggingRegressor(\n",
    "    estimator=DecisionTreeRegressor(max_depth=3),\n",
    "    n_estimators=100,\n",
    ")\n",
    "_ = bagged_trees.fit(data_train, target_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us visualize the predictions of the ensemble on the same interval of data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.scatterplot(\n",
    "    x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n",
    ")\n",
    "\n",
    "bagged_trees_predictions = bagged_trees.predict(data_test)\n",
    "plt.plot(data_test[\"Feature\"], bagged_trees_predictions)\n",
    "\n",
    "_ = plt.title(\"Predictions from a bagging regressor\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because we use 100 trees in the ensemble, the average prediction is indeed\n",
    "slightly smoother but very similar to our previous average plot.\n",
    "\n",
    "It is possible to access the internal models of the ensemble stored as a\n",
    "Python list in the `bagged_trees.estimators_` attribute after fitting.\n",
    "\n",
    "Let us compare the based model predictions with their average:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for tree_idx, tree in enumerate(bagged_trees.estimators_):\n",
    "    label = \"Predictions of individual trees\" if tree_idx == 0 else None\n",
    "    # we convert `data_test` into a NumPy array to avoid a warning raised in scikit-learn\n",
    "    tree_predictions = tree.predict(data_test.to_numpy())\n",
    "    plt.plot(\n",
    "        data_test[\"Feature\"],\n",
    "        tree_predictions,\n",
    "        linestyle=\"--\",\n",
    "        alpha=0.1,\n",
    "        color=\"tab:blue\",\n",
    "        label=label,\n",
    "    )\n",
    "\n",
    "sns.scatterplot(\n",
    "    x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n",
    ")\n",
    "\n",
    "bagged_trees_predictions = bagged_trees.predict(data_test)\n",
    "plt.plot(\n",
    "    data_test[\"Feature\"],\n",
    "    bagged_trees_predictions,\n",
    "    color=\"tab:orange\",\n",
    "    label=\"Predictions of ensemble\",\n",
    ")\n",
    "_ = plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We used a low value of the opacity parameter `alpha` to better appreciate the\n",
    "overlap in the prediction functions of the individual trees. Such\n",
    "visualization also gives us an intuition on the variance in the predictions\n",
    "across different zones of the feature space.\n",
    "\n",
    "## Bagging complex pipelines\n",
    "\n",
    "Even if here we used a decision tree as a base model, nothing prevents us from\n",
    "using any other type of model.\n",
    "\n",
    "As we know that the original data generating function is a noisy polynomial\n",
    "transformation of the input variable, let us try to fit a bagged polynomial\n",
    "regression pipeline on this dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Ridge\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "\n",
    "polynomial_regressor = make_pipeline(\n",
    "    MinMaxScaler(),\n",
    "    PolynomialFeatures(degree=4, include_bias=False),\n",
    "    Ridge(alpha=1e-10),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This pipeline first scales the data to the 0-1 range using `MinMaxScaler`. It\n",
    "then generates degree-4 polynomial features. By design, these features remain\n",
    "in the 0-1 range, as any power of `x` within this range also stays within 0-1.\n",
    "\n",
    "Then the pipeline feeds the resulting non-linear features to a regularized\n",
    "linear regression model for the final prediction of the target variable.\n",
    "\n",
    "Note that we intentionally use a small value for the regularization parameter\n",
    "`alpha` as we expect the bagging ensemble to work well with slightly overfit\n",
    "base models.\n",
    "\n",
    "The ensemble itself is simply built by passing the resulting pipeline as the\n",
    "`estimator` parameter of the `BaggingRegressor` class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "bagging = BaggingRegressor(\n",
    "    estimator=polynomial_regressor,\n",
    "    n_estimators=100,\n",
    "    random_state=0,\n",
    ")\n",
    "_ = bagging.fit(data_train, target_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i, regressor in enumerate(bagging.estimators_):\n",
    "    # we convert `data_test` into a NumPy array to avoid a warning raised in scikit-learn\n",
    "    regressor_predictions = regressor.predict(data_test.to_numpy())\n",
    "    base_model_line = plt.plot(\n",
    "        data_test[\"Feature\"],\n",
    "        regressor_predictions,\n",
    "        linestyle=\"--\",\n",
    "        alpha=0.2,\n",
    "        label=\"Predictions of base models\" if i == 0 else None,\n",
    "        color=\"tab:blue\",\n",
    "    )\n",
    "\n",
    "sns.scatterplot(\n",
    "    x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n",
    ")\n",
    "bagging_predictions = bagging.predict(data_test)\n",
    "plt.plot(\n",
    "    data_test[\"Feature\"],\n",
    "    bagging_predictions,\n",
    "    color=\"tab:orange\",\n",
    "    label=\"Predictions of ensemble\",\n",
    ")\n",
    "plt.ylim(target_train.min(), target_train.max())\n",
    "plt.legend()\n",
    "_ = plt.title(\"Bagged polynomial regression\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The predictions of this bagged polynomial regression model looks qualitatively\n",
    "better than the bagged trees. This is somewhat expected since the base model\n",
    "better reflects our knowledge of the true data generating process.\n",
    "\n",
    "Again the different shades induced by the overlapping blue lines let us\n",
    "appreciate the uncertainty in the prediction of the bagged ensemble.\n",
    "\n",
    "To conclude this notebook, we note that the bootstrapping procedure is a\n",
    "generic tool of statistics and is not limited to build ensemble of machine\n",
    "learning models. The interested reader can learn more on the [Wikipedia\n",
    "article on\n",
    "bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics))."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}